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Abstract 

The Boltzmann machine provides a useful framework to learn highly complex, multimodal and multiscale 
data distributions that occur in the real world. The default method to learn its parameters consists of minimizing 
the Kullback-Leibler (KL) divergence from training samples to the Boltzmann model. We propose in this work 
a novel approach for Boltzmann training which assumes that a meaningful metric between observations is given. 
This metric can be represented by the Wasserstein distance between distributions, for which we derive a gradient 
with respect to the model parameters. Minimization of this new Wasserstein objective leads to generative models 
that are better when considering the metric and that have a cluster-like structure. We demonstrate the practical 
potential of these models for data completion and denoising, for which the metric between observations plays a 
crucial role. 


1 Introduction 

Boltzmann machines m are powerful generative models that can be used to approximate a large class of real-world 
data distributions, such as handwritten characters (7), speech segments ID, or multimodal data m. Boltzmann 
machines share similarities with neural networks in their capability to extract features at multiple scales, and 
to build well-generalizing hierarchical data representations lfl3l [Till . The restricted Boltzmann machine (called 
RBM) is a special type of Boltzmann machine defining a probability distribution over a set of d binary observable 
variables whose state is represented by the vector x E {0, l} d and a set of h explanatory variables, also binary. 
The distribution of the RBM can always be written in marginalized form as 

pe(x) = ±e- F °W 

where the function Fq(x) is called the free energy and is parameterized by a vector of parameters 6. Zq is called 
the partition function and normalizes the distribution po to 1. Given an empirical probability distribution p(x) = 
where ( x n ) n is a list of N observations in {0, l} d , an RBM can be trained using information- 
theoretic divergences (see for example ESI) by minimizing with respect to 0 a divergence A (p,po) between the 

* Also with the Department of Brain and Cognitive Engineering, Korea University. 
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Figure 1: Empirical distribution p(x) (gray) defined on the set of states {0, l} d with d = 3 shown next to two pos¬ 
sible modeled distributions defined on the same set of states. The size of the circles indicates the probability mass 
allocated to each state. The first modeled distribution po(x) (blue) has low KL divergence and high Wasserstein 
distance from the empirical distribution. The second one pg> ( x ) (red) has high KL divergence and low Wasserstein 
distance, and thus incorporates the desired metric. 


sample empirical measure p and the modeled distribution po : 

minA (p,po). (1) 

0E© 

When A is for instance the KL divergence, this approach results in the well-known Maximum Likelihood Estimator 
(MLE), which yields gradients for the 6 of the form 

V e KL(p\\p e ) = (V 0 F e (x)) p - (\7 e F e (x)) pe , (2) 

where the bracket notation (-) p indicates an expectation with respect to p. The KL gradient involves the mean of 
the gradient of Fq evaluated on observations, contrasted by its expectation under pq. Alternative choices for A are 
the Bhattacharrya/Hellinger and Euclidean distances between distributions, or more generally F-divergences or 
M-estimators 0 . They all result in comparable gradient terms, that try to adjust 9 so that the fitting terms po(x n ) 
grow as large as possible. 

We explore in this work a different scenario: what if 6 is chosen so that po(x) is large, on average, when x is 
close to a data point x n in some sense, but not necessarily when x coincides exactly with x n l To adopt such a 
geometric criterion, we must first define what closeness between observations means. In almost all applications of 
Boltzmann machines, such a metric between observations is readily available: One can for example consider the 
Hamming distance between binary vectors, or any other metric motivated by practical consideration^] This being 
done, the geometric criterion we have drawn can be materialized by considering for A the Wasserstein distance 
[ 18J (a.k.a. the Kantorovich or the earth mover’s distance G2) between measures. This choice was considered 
in theory by ( 2 ), who proved its statistical consistency, but was never considered practically to the best of our 
knowledge. This paper describes a practical derivation for a minimum Kantorovich distance estimator 10 for 
Boltzmann machines, which can scale up to tens of thousands of observations. As will be described in this paper, 
recent advances in the fast approximation of Wasserstein distances (4| and their derivatives 0 play an important 
role in the practical implementation of these computations. 

Before describing this approach in detail, we would like to insist that measuring goodness-of-fit with the 
Wasserstein distance results in a considerably different perspective than that provided by a Kullback-Leibler/MLE 
approach. This difference is illustrated in Ligure[l] where a probability po can be close from a KL perspective to 
a given empirical measure p , but far from the same measure p in the Wasserstein sense. Conversely, a different 
probability po> can miss the mark from a KL viewpoint but achieve a low Wasserstein distance to p. 

2 A Practical Framework for Minimum Wasserstein Distance Estimation 

Consider two probabilities p,q in V{X), the set of probabilities on X = {0, l} d . Namely, two maps p,q : 
<Y —» M + such that J2 x p(x) = ^Z x q{x) = 1, where we omit x G <Y under the summation sign. Consider 
a distance function D : <Y x A’ M + which satisfies for any triplet x,x',x" G <Y the triangle inequality 
D(x,x") < D(x,x') + D(x',x") and D(x,x) = 0. Given a constant 7 > 0, the 7 -smoothed Wasserstein 
distance Ql is equal to 

W 7 (p> 9)= min {D(x,x')) n - jH(n), (3) 

7TEII (p,q) 

^hen using the MLE principle, metric considerations play a key role to define densities po, e.g. the reliance of Gaussian densities on 
Euclidean distances. This is the kind of metric we take for granted in this work. 
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where II(p, q ) is the set of joint probabilities 7r on A' x ^ such that J2 X ' n ( x i x ') = p{ x )> ^2 X n ( x i x ') = Q{ x ') 
and H(tt) = — Jf xx , k(x,x') log7r(cc, x') is the Shannon entropy of 7 r. This optimization problem, a strictly 
convex program, has an equivalent dual formulation a which involves instead two real-valued functions a, f3 on 
X and which plays an important role in this paper: 

W 7 (p, q) = max («(*))„ + (P(x')) q - 7 V (4) 

XX 


Smooth Wasserstein Distances The “true” Wasserstein distance corresponds to the case where 7 = 0, when 
Equation ([3]) is stripped of the entropic term. The reader will easily verify that it matches the usual linear program 
used to describe Wasserstein/EMD distances IT 2 l . When 7 —0 in Equation ([?]), one also recovers the Kantorovich 
dual formulation, because the rightmost regularizer converges to the indicator function of the feasible set of the 
dual optimal transport problem, a(x) + /3(x f ) < D(x , as'). We consider in this paper the case 7 > 0 because 
it was shown in (4) to considerably facilitate computations, and in Q to result in a divergence W 7 (p, q) which, 
unlike the case 7 = 0, is differentiable w.r.t to the first variable. Looking at the dual formulation in Equation 0, 
one can see that this gradient is equal to a *, the centered optimal dual variable (the centering step for a* ensures 
the orthogonality with respect to the simplex constraint). 

Sensitivity analysis gives a clear interpretation to the quantity a*(as): It measures the cost for each unit of 
mass placed by p at x when computing the Wasserstein distance W 7 (p, q). To decrease W 7 (p, q) 9 it might thus be 
favorable to transfer mass in p from points where a(x) is high to place it on points where a (x) is low. This idea 
can be used, by a simple application of the chain rule, to minimize, given a fixed target probability p, the quantity 
W 7 (po , p) with respect to 6. 

Proposition 1. Let po(x) = T e ~ Fe be a parameterized family of probability distributions where Fq(x) is a 
differentiable function of 6 G 0 and we write Gq = (V qFq(x)) P o . Let a* be the centered optimal dual solution of 
W 7 (pe , p) as in Equation 0 - The gradient of the smoothed Wasserstein distance with respect to 0 is given by 

V e W 7 (p e ,p) = (a*(x)) pg G e - ( a*(x)V e F e (x))) pe . (5) 

Proof This result is a direct application of the chain rule: We have 

nui/ x (dpe\ T dyV 1 (pe,q) 

V.W 7 ( w ,p) = (—) ^ . 

As mentioned in 0, the rightmost term is the optimal dual variable (the Kantorovich potential) dW 1 (po, q) /dpe = 
a *. The Jacobian ( dpe/dO ) is a linear map 0 For a given x f , 

dpe{x')/d0 = pq(x')Gq - 1 S/F e {x')p e {x '). 

As a consequence, (^f ) T <^* is the integral w.r.t. x' of the term above multiplied by a*(x f ), which results in 
Equation ([5]). □ 

Comparison with the KL Fitting Error The target distribution p plays a direct role in the formation of the gra¬ 
dient of KL(p || po) w.r.t. 0 through the term (S7qFq(x)) p in Equation The Wasserstein gradient incorporates 
the knowledge of p in a different way, by considering, on the support of pe only, points x that correspond to high 
potentials (costs) a(x) when computing the distance of pe to p. A high potential at x means that the probability 
Po(x) should be lowered if one were to decrease W 7 (p 0 ,p), by varying 0 accordingly. 

Sampling Approximation The gradient in Equation ([5]) is intractable, since it involves solving an optimal 
(smoothed) transport problem over probabilities defined on 2 d states. In practice, we replace expectations w.r.t 
po by an empirical distribution formed by sampling from the model po (e.g. the PCD sample fTbh . Given a sample 

(x n ) n of size N generated by the model, we define po = & Xri /N. The tilde is used to differentiate the 

sample generated by the model from the empirical observations. Because the dual potential a* is centered and po 
is a measure with uniform weights, {a k (x))f >e = 0 which simplifies the approximation of the gradient to 

V 9 W 7 ( Wl p) = -i V e Fe(x n ) ( 6 ) 
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where a* is the solution of the discrete smooth Wasserstein dual between the two empirical distributions p and po, 
which have respectively supports of size N and N. In practical terms, a* is a vector of size TV, one coefficient for 
each PCD sample, which can be computed by following the algorithm below na. To keep notations simple, we 
describe it in terms of generic probabilities p and q, having in mind these are in practice the training and simulated 
empirical measures p and pe . 

Computing a* When 7 > 0, the optimal variable a* corresponding to W 7 (p,q) can be recovered through 
the Sinkhorn algorithm with a cost which grows as the product \p\\q\ of the sizes of the support of p and q, where 
\p\ = l p (a;)>o- The algorithm is well known but we adapt it here to our setting, see [5] Alg.3] for a more precise 

description. To ease notations, we consider an arbitrary ordering of X, a set of cardinal 2 d , and identify its elements 
with indices 1 < i < 2 d . Let I — (ii, • • • , i \ p \) be the ordered family of indices in the set {i \ p(i) > 0} and define 
J accordingly for q. I and J have respective lengths \p\ and \q\. Form the matrix K = °f size 

\p\ and \q\. Choose now two positive vectors u G and v G at random, and repeat until u, v converge 
in some metric the operations u <— p/(Kv),v <— q/ ( K T u ). Upon convergence, the optimal variable a * is zero 
everywhere except for a*(i a ) = \og(u a /u)/j where 1 < a < \p\ and u is the geometric mean of vector u (which 
ensures that a* is centered). 

3 Wasserstein Training of a Restricted Boltzmann Machine 

The restricted Boltzmann machine (RBM) is a generative model of binary data that is composed of d binary 
observed variables and h binary explanatory variables. The vector x G {0, l} d represents the state of observed 
variables, and the vector y G {0, l} h represents the state of explanatory variables. The RBM associates to each 
configuration x of observed variables a probability po(x) defined as 

where E$(x,y) = —a T x — Y^= 1 Vj ( W J x + bj) is called the energy and 6 = (a, {wj,bj}j = i) are the parameters 
of the RBM. These parameters must be learned from the data. Knowing the state x of the observed variables, the 
explanatory variables are independent Bernoulli-distributed with Pr (yj = l\x) = a(wjx + bj), where a is the 
logistic map z \-± (1 + e - *) -1 . Conversely, knowing the state y of the explanatory variables, the observed variables 
on which the probability distribution is defined can also be sampled independently, leading to an efficient alternate 
Gibbs sampling procedure for pq. In this RBM model, explanatory variables can be analytically marginalized, 
allowing us to rewrite the probability model as po(x) = where Fq{x) = —a T x — J2j l°g(l + 

ex.p(wjx + bj)) is the free energy associated to this model. 

Wasserstein Gradient of the RBM Having written the RBM in its free energy form, the Wasserstein gradient 
can be obtained by computing the gradient of Fq(x) and injecting it in Equation ([ 6 ]): 

V Wj yV y (pff,p) = (a*(x)a(z j )x) pe , 

where Zj = wJx + bj. Gradients with respect to parameters a and {bj}j can also be obtained by the same means. 
In comparison, the gradient of the KL divergence is given by V^KL(p || po) = ( a(zj ) x)~ e — (&(zj) - While 

the Wasserstein gradient can in the same way as the KL gradient be expressed in a very simple form, the first one is 
not sum-decomposable. A simple manifestation of the non-decomposability occurs for TV = 1 (smallest possible 
sample size): In that case, a(x n ) = 0 due to the centering constraint (see Section [ 2 ]), thus making the gradient 
zero. 

Stability and KL Regularization Unlike the KL gradient, the Wasserstein gradient only depends on the gen¬ 
erated sample po, and not the data distribution p. This is a problem when the sample pe generated by the model 
strongly differs from the examples coming from p, because there is no weighting (a(x n )) n of the generated sam¬ 
ple that can represent the desired direction in 0. In that case, the Wasserstein gradient will point to a bad local 
minimum. Closeness between the two empirical samples from this optimization perspective can be ensured by 
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adding a regularization term to the objective of the form 

n(d) = KL(p || p e ) + rj ■ (||a || 2 + ||w.j || 2 ). 

3 

It incorporates the usual quadratic containment term, but more importantly, the KL term, that forces proximity to 
p due to the direct dependence of its gradient on it. The optimization problem becomes: 

min W 7 (p< 9 ,p) + A • Q(6) 
o 

starting at point Oo = arg min^© Q(6), and where A, r] are two regularization hyperparameters that must be se¬ 
lected. Determining the starting point Oo is analogous to performing an initial pretraining step. Thus, the proposed 
Wasserstein procedure can also be seen as finetuning a standard RBM, and forcing the finetuning not to deviate too 
much from the initial solution. 


4 Experiments 

We perform several experiments that demonstrate that Wasserstein-trained RBMs learn distributions that are better 
from a metric perspective. First, we explore what are the main characteristics of a learned distribution that opti¬ 
mizes the Wasserstein objective. Then, we investigate the usefulness of these learned models on practical problems 
such as data completion and denoising, where the metric between observations occurs in the performance evalua¬ 
tion. We use two datasets: The first one is the MNIST dataset (E consisting of 60000 handwritten digits of size 
28 x 28. For the purpose of our experiments, the images are downsized to 14 x 14 pixels, and binarized with 
the mean pixel value of each individual pixel as threshold. We focus on modeling digits of class “0”. There are 
5923 such examples. The second dataset is the UCI PLANTS dataset fTT) . that associates to each plant species, 
a 70-dimensional binary vector indicating its presence or absence in each US state or Canadian province. For 
the purpose of modeling a smooth-looking data distribution, too frequent or infrequent plants are discarded with 
probability 1 — e -( 3 0-°- 5 )) where v G [0,1] is the plant occurrence frequency. This results in a dataset of 6539 
plants. The two datasets are then randomly partitioned in three equal-sized subsets used for training, validation 
and test. 

4.1 Training, Validation and Evaluation 

All RBM models that we investigate are trained using for pe the PCD approximation JT6) of pe , where the sample is 
refreshed at each gradient update by one step of alternate Gibbs sampling, starting from the sample at the previous 
time step. We choose a PCD sample of same size as the training set (N = N). The coefficients oq ? ... 
occurring in the Wasserstein gradient are obtained by solving the smoothed Wasserstein dual between pe and 
Pe , with smoothing parameter 7 = 0.1 and distance D(x,x') = 'H(x,x , )/(H(x,x , ))p, where H denotes the 
Hamming distance between two binary vectors. We use the centered parameterization of the RBM for gradient 
descent DUE). The learning rate is set heuristically to 0.01 (A x ) during the pretraining phase and modified to 
0.01 min(l, A -1 ) when training on the final objective. We perform holdout validation on the quadratic containment 
coefficient 7 G {10 - 4 ,10 - 3 ,10 -2 }, and on the KL weighting coefficient A G {0,10 - 1 ,10°, 10 1 , 00 }. The number 
of hidden units of the RBM is set heuristically to 400 for both datasets. In our experiments, the likelihood term 
of the KL divergence is evaluated by estimating the partition function Z using AIS with 100 examples annealed 
in 1000 steps of increasingly small temperature differences. The Wasserstein distance W^(pe,p) is computed 
between the whole test distribution and the PCD sample at the end of the training procedure. This sample is a fast 
approximation of the true unbiased sample, that would otherwise have to be generated by annealing or enumeration 
of the states. 

4.2 Results and Analysis 

The contour plots of Figure [2] show the effect of hyperparameters A and 7 on the KL divergence and the Wasserstein 
distance. For A = 00 , only the KL regularizer is active, which is equivalent to minimizing a standard RBM. In that 
case, we obtain a low KL divergence. As we reduce the amount of regularization, the Wasserstein distance becomes 
effectively minimized and thus smaller. If A is chosen too small, the Wasserstein distance increases again, for the 
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Figure 2: Contour plots showing the measure of error as a function of the regularization hyperparameters A and 77 . 
The best Wasserstein-trained RBMs (RBM-W) are shown in red. The best standard RBMs (i.e. with A forced to 
+inf) are shown in blue. 
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Figure 3: Two-dimensional PCA comparison of distributions learned by the RBM and the RBM-W. Plots are 
obtained by projecting the learned distributions on the first components of the true distribution. 


stability reasons mentioned in Section [3] In all our experiments, we observed that KL pretraining was necessary 
in order to reach low Wasserstein distance. Not doing so leads to degenerate solutions. The relation between 
hyperparameters and minimization criteria is consistent across the two datasets: In both cases, the Wasserstein 
RBM produces lower Wasserstein distance than a standard RBM. 

The PCA plots of Figure [3] superimpose to the true data distribution (in gray) the distributions generated by the 
standard RBM (in blue) and the Wasserstein RBM (in red). In particular, the plots show the projected distributions 
onto the two PCA components of the true distribution. While the standard RBM distribution uniformly covers 
the data, the one generated by the RBM-W consists of a finite set of small dense clusters that are scattered across 
the input distribution. In other words, the Wasserstein model is biased towards these clusters, and systematically 
ignores other regions. Although the KL-generated distributions shown in blue may look better (the red distribution 
strongly departs visually from the data distribution), the red distribution is actually superior if considering the 
smooth Wasserstein distance as a performance metric, as shown in Figure [2] 

Samples generated by the standard RBM and the Wasserstein RBM (more precisely their PCD approximation) 
are shown in Figure]?] The RBM-W produces a reduced set of clean prototypical examples, with less noise than 
those produced by a regular RBM. All handwritten digits generated by RBM-W have well-defined contours and a 
round shape. However, they do not reproduce the variety of shapes present in the data. Similarly, the plants species 
territorial spreads as generated by the RBM-W, form compact and contiguous regions that are prototypical of real 
spreads, but are also less diverse than the data or the sample generated by the standard RBM. 

4.3 Application to Data Completion and Denoising 

In order to demonstrate the practical relevance of Wasserstein distance minimization, we apply the learned models 
to the task of data completion and data denoising, for which the use of a metric is crucial: Data completion and 
data denoising performance is generally measured in terms of distance between the true data and the completed or 
denoised data (e.g. Euclidean distance for real-valued data, or Hamming distance 1~L for binary data). Remotely 
located probability mass that may result from simple KL minimization would incur a severe penalty on the com¬ 
pletion and denoising performance metric. Both tasks have useful practical applications: Data completion can be 
used as a first step when applying discriminative learning (e.g. neural networks or SVM) to data with missing 
features. Data denoising can be used as a dimensionality reduction step before training a supervised model. Let 
the input x = [v, h] be composed of d — k visible variables v and k hidden variables h. 
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Figure 4: Samples of the MNIST and PLANTS dataset, and samples generated by the standard and the 
Wasserstein RBMs. (Images for the PLANTS data are automatically generated from the Wikimedia Commons 
template https : //commons . wikimedia . org/wiki/File : BlankMap-USA-states-Canada-provinces . 
svg created by user LokaLProfil.) 


Data Completion The setting of the data completion experiment is illustrated in Figure [5] (top). The distribution 
Po(x\v ) over possible reconstructions can be sampled from using an alternate Gibbs sampler, or by enumeration. 
The expected Hamming distance between the true state x* and the reconstructed state modeled by the distribution 
pe(x\v) is given by iterating on the 2 k possible reconstructions: £ = ^2he{o,i} k Po{ x I v ) * x*). Since the 

reconstruction is a probability distribution, we can compute the expected Hamming error, but also its bias-variance 
decomposition. On MNIST, we hide randomly located image patches of size 3x3 (i.e. k = 9). On PLANTS, 
we hide random subsets of k = 9 variables. Results are shown in Figure [6] (left), where we compare three types 
of models: Kernel density estimation (KDE), standard RBM (RBM) and Wasserstein RBM (RBM-W). The KDE 
estimation model uses a Gaussian kernel, with the Gaussian scale parameter chosen such that the KL divergence 
of the model from the validation data is minimized. The RBM-W is better or comparable the other models. Of 
particular interest is the structure of the expected Hamming error: For the standard RBM, a large part of the error 
comes from the variance (or entropy), while for the Wasserstein RBM, the bias term is the most contributing. This 
can be related to what is observed in Figure [3j For a data point outside the area covered by the red points, the 
reconstruction is systematically redirected towards the nearest red cluster, thus, incurring a systematic bias. 

Data Denoising Here, we consider a simple noise process where for a predefined subset of k variables, denoted 
by h a known number l of bits flips occur randomly. Remaining d — k variables are denoted by v. The setting 
of the experiment is illustrated in Figure [5] (bottom). Denoting x* the original and x its noisy version resulting 
from flipping l variables of h, the expected Hamming error is given by iterating over the states x with same 
visible variables v and that are at distance l of x: £ = X^e{o i }k Po{x \ v , %{x, x) ml)- H(x, x*). Note that the 
original example x* is necessarily part of this set of states under the noise model assumption. For the MNIST data, 
we choose randomly located images patches of size 4 x 3 or 3 x 4 (i.e. k = 12), and generate l = 4 random bit 
flips within the selected patch. For the PLANTS data, we generate Z = 4 bit flips in k = 12 randomly preselected 
input variables. Figure [6] (right) shows the denoising error in terms of expected Hamming distance on the same two 
datasets. The RBM-W is better or comparable to other models. Like for the completion task, the main difference 
between the two RBMs is the bias/variance ratio, where again the Wasserstein RBM tends to have larger bias. This 
experiment has considered a very simple noise model consisting of a fixed number of l random bit flips over a 
small predefined subset of variables. Denoising highly corrupted complex data will however require to combine 
Wasserstein models with more flexible noise models such as the ones proposed by ca 
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Figure 5: Illustration of the completion and denoising setup. For each image, we select a known subset of pixels, 
that we hide (or corrupt with noise). Each possible reconstruction has a particular Hamming distance to the original 
example. The expected Hamming error is computed by weighting the Hamming distances by the probability that 
the model assigns to the reconstructions. 
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Figure 6: Performance on the completion and denoising tasks of the kernel density estimation, the standard RBM 
and the Wasserstein RBM. The total length of the bars is the expected Hamming error. Dark gray and light gray 
sections of the bars give the bias-variance decomposition. 


5 Conclusion 

We have introduced a new objective for Boltzmann machines based on the smooth Wasserstein distance. Unlike 
the usual Kullback-Leibler (KL) divergence, our objective takes into account the metric of the data. The objective 
admits a simple gradient, that can be computed by solving the dual of the Wasserstein distance between the learned 
and observed distributions. We learned a Wasserstein model on two simple problems: In both cases, the learned 
distributions strongly departed from the KL model, and formed instead a set of clusters of prototypical examples 
(well-shaped digits for MNIST, and contiguous territorial spreads for PLANTS). 

We have evaluated the Wasserstein RBM on two basic completion and denoising tasks, for which the metric of 
the data intervenes in the performance evaluation. In this simple setting, we have demonstrated the superiority of 
the RBM-W over the standard RBM, and how the bias-variance structure of the estimator systematically differs. 
Our contribution aims principally at introducing a novel type of objective for the Boltzmann machine where it did 
not exist before, and showing that Boltzmann machines and Wasserstein methods can be combined. In particular, 
our work gives an additional practical motivation for developing Wasserstein methods that run quickly on large 
datasets. 
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